Supervised or unsupervised & model types

Peer Herholz (he/him)
Postdoctoral researcher - NeuroDataScience lab at MNI/McGill, UNIQUE
Member - BIDS, ReproNim, Brainhack, Neuromod, OHBM SEA-SIG

logo logo   @peerherholz

logo

Aim(s) of this section

  • learn about the distinction between supervised & unsupervised machine learning

  • get to know the variety of potential models within each

Outline for this section

  1. supervised vs. unsupervised learning

  2. supervised learning examples

  3. unsupervised learning examples

A brief recap & first overview

  • let’s bring back our rough analysis outline that we introduced in the previous section

logo
  • so far we talked about how a Model (M) can be utilized to obtain information (output) from a certain input

  • the information requested can be manifold but roughly be situated on two broad levels:

    • learning problem

      • supervised or unsupervised

    • specific task type

      • predicting clinical measures, behavior, demographics, other properties

      • segmentation

      • discover hidden structures

      • etc.

logo

https://scikit-learn.org/stable/_static/ml_map.png

logo

https://scikit-learn.org/stable/_static/ml_map.png

logo

https://scikit-learn.org/stable/_static/ml_map.png

Learning problems - supervised vs. unsupervised

logo
  • if we now also include task type we can basically describe things via a 2 x 2 design:

logo

Our example dataset

Now that we’ve gone through a huge set of definitions and road maps, let’s go away from this rather abstract discussions to the “real deal”, i.e. seeing how these models behave in the wild. For this we’re going to sing the song “Hello example dataset my old friend, I came to apply machine learning to you again.”. Just to be sure: we will use the example dataset we briefly explored in the previous section again to showcase how the models we just talked about can be put into action, as well as how they change/affect the questions we can address and we have to interpret the results.

At first, we’re going to load our input data, i.e. X again:

import numpy as np

data = np.load('MAIN2019_BASC064_subsamp_features.npz')['a']
data.shape
(155, 2016)
  • just as a reminder: what we have in X here is a vectorized connectivity matrix containing 2016 features, which constitutes the correlation between brain region-specific time courses for each of 155 samples (participants)

  • as before, we can visualize our X to inspect it and maybe get a first idea if there might be something going on

import plotly.express as px
from IPython.core.display import display, HTML
from plotly.offline import init_notebook_mode, plot

fig = px.imshow(data, labels=dict(x="features", y="participants"), height=800, aspect='None')

fig.update(layout_coloraxis_showscale=False)
init_notebook_mode(connected=True)

#fig.show()

plot(fig, filename = 'input_data.html')
display(HTML('input_data.html'))
  • at this point we already need to decide on our learning problem:

    • do we want to utilize the information we already have (labels) and thus conduct a supervised learning analysis to predict Y

    • do we not want to utilize the information we already have and thus conduct an unsupervised learning analysis to e.g. find clusters or decompose

  • please note: we only do this for the sake of this workshop! Please never do this type of “Hm, maybe we do this or this, let’s see how it goes.” approach in your research. Always make sure you have a precise analyses plan that is informed by prior research and guided by the possibilities of your data. Otherwise you’ll just add to the ongoing reproducibility and credibility crisis, not accelerating but hindering scientific progress. (However, the other option is that you conduct exploratory analyses and just be honest about it, not acting as they are confirmatory.)

  • that being said: we’re going to basically test of all them (talking about “to not practise what one preaches”, eh?), again, solely for teaching reasons

  • we’re going to start with supervised learning, thus using the information we already have

Supervised learning

  • independent of the precise task type we want to run, we initially need to load the information, i.e. labels, available to us:

import pandas as pd
information = pd.read_csv('participants.csv')
information.head(n=5)
participant_id Age AgeGroup Child_Adult Gender Handedness
0 sub-pixar123 27.06 Adult adult F R
1 sub-pixar124 33.44 Adult adult M R
2 sub-pixar125 31.00 Adult adult M R
3 sub-pixar126 19.00 Adult adult F R
4 sub-pixar127 23.00 Adult adult F R
  • as you can see, we have multiple variables, i.e. labels describing our participants, i.e. samples and almost each of them can be used to address a supervised learning problem (e.g. Child_Adult)

logo
  • goal: Learn parameters (or weights) of a model (M) that maps X to y

  • however, while some are categorical and thus could be employed within a classification analysis, some are continuous and thus would fit within a regression analysis (e.g. Age)

  • we’re going to check both

Supervised learning - classification

logo
  • goal: Learn parameters (or weights) of a model (M) that maps X to y

  • in order to run a classification analysis, we need to obtain the correct categorical labels defining them as our Y

Y_cat = information['Child_Adult']
Y_cat.describe()
count       155
unique        2
top       child
freq        122
Name: Child_Adult, dtype: object
  • we can see that we have two unique expressions, but let’s plot the distribution just to be sure and maybe see something important/interesting:

fig = px.histogram(Y_cat, marginal='box', template='plotly_white')

fig.update_layout(showlegend=False)
init_notebook_mode(connected=True)

#fig.show()

plot(fig, filename = 'labels.html')
display(HTML('labels.html'))
  • that looked about right and we can continue with our analysis

  • to keep things easy, we will use the same pipeline we employed in the previous section, that is we will scale our input data, train a Support Vector Machine and test its predictive performance:

from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
pipe = make_pipeline(
...     StandardScaler(),
...     SVC()
... )

A bit of information about Support Vector Machines:

  • non-probabilistic binary classifier

    • samples are in one of two classes

  • utilization of hyperplane as decision boundaries

    • n feature dimensions - 1

  • support vectors

    • small vs. large margins

logo

Pros

  • effective in high dimensional spaces

    • Still effective in cases where number of dimensions is greater than the number of samples.

  • uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

  • versatile: different Kernel functions

Cons

  • if number of features is much greater than the number of samples: danger of over-fitting

    • make sure to check kernel and regularization

  • SVMs do not directly provide probability estimates

  • before we can go further, we need to divide our input data X into training and test sets:

X_train, X_test, y_train, y_test = train_test_split(data, Y_cat, random_state=0)
  • and can already fit our analysis pipeline:

pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()), ('svc', SVC())])
  • followed by testing the model’s predictive performance:

print('accuracy is %s with chance level being %s' %(accuracy_score(pipe.predict(X_test), y_test), 1/len(pd.unique(Y_cat))))
accuracy is 0.8974358974358975 with chance level being 0.5

(spoiler alert: can this be right?)

Supervised learning - regression

  • after seeing that we can obtain a super high accuracy using a classification approach, we’re hooked and want to check if we can get an even better performance via addressing our learning problem via a regression approach

  • for this to work, we need to change our labels, i.e. Y from a categorical to a continuous variable:

information.head(n=5)
participant_id Age AgeGroup Child_Adult Gender Handedness
0 sub-pixar123 27.06 Adult adult F R
1 sub-pixar124 33.44 Adult adult M R
2 sub-pixar125 31.00 Adult adult M R
3 sub-pixar126 19.00 Adult adult F R
4 sub-pixar127 23.00 Adult adult F R
  • here Age seems like a good fit:

Y_con = information['Age']
Y_con.describe()
count    155.000000
mean      10.555189
std        8.071957
min        3.518138
25%        5.300000
50%        7.680000
75%       10.975000
max       39.000000
Name: Age, dtype: float64
  • however, we are of course going to plot it again (reminder: always check your data):

fig = px.histogram(Y_con, marginal='box', template='plotly_white')

fig.update_layout(showlegend=False)
init_notebook_mode(connected=True)

#fig.show()

plot(fig, filename = 'labels.html')
display(HTML('labels.html'))
  • the only thing we need to do to change our previous analysis pipeline a classification to a regression task is to adapt the estimator accordingly:

from sklearn.linear_model import LinearRegression
pipe = make_pipeline(
...     StandardScaler(),
...     LinearRegression()
... )

A bit of information about regression

  • modelling the relationship between a scalar response and one or more explanatory variables

logo

Pros

  • simple implementation, efficient & fast

  • good performance in linear separable datasets

  • can address overfitting via regularization

Cons

  • prone to underfitting

  • outlier sensitivity

  • assumption of independence

  • the rest of the workflow is almost identical to the classification approach

  • after splitting the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(data, Y_con, random_state=0)
  • we fit the pipeline:

pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])
  • which predictive performance can then be evaluated:

from sklearn.metrics import mean_absolute_error

print('mean absolute error in years: %s against a data distribution from %s to %s years' %(mean_absolute_error(pipe.predict(X_test), y_test), Y_con.min(), Y_con.max()))                                                                                         
mean absolute error in years: 4.116128254997565 against a data distribution from 3.518138261 to 39.0 years

Question: Is this good or bad?

Having spent a look at classification and regression via respectively common models we will devote some time to two other prominent models that can be applied within both tasks. (For the sake of completeness, please note that SVMs can also be utilized within regression tasks, changing from a support vector classifier to a support vector regression.)

Supervised learning - nearest neighbor

logo
  • non-parametric method

    • distribution-free or
      specific distribution + unspec. parameters

  • classification and regression

    • class or object property value

  • k-nearest neighbors

    • sensitive to local structure of data

Pros

  • intuitive and simple

  • no assumptions

  • one hyperparameter

  • variety of distance parameters

Cons

  • slow and sensitive to outliers

  • curse of dimensionality

  • requires homogeneous features and works best with balanced classes

  • how to determine k

  • as before, changing our pipeline to use k-nearest neighbor or knn as the estimator is very easy

  • we just need to import the respective class and put it into our pipeline:

from sklearn.neighbors import KNeighborsClassifier
pipe = make_pipeline(
...     StandardScaler(),
...     KNeighborsClassifier()
... )
  • given we can tackle both, classification and regression tasks, we will actually do both and compare the outcomes to the results we got before using different estimators

  • let’s start with classification for which we need our categorical labels:

Y_cat.describe()
count       155
unique        2
top       child
freq        122
Name: Child_Adult, dtype: object
  • by now you know the rest, we divide into train and test set, followed by fitting our analysis pipeline and then testing its predictive performance

  • to ease up the comparison with the SVM, we will pack things into a small for-loop, iterating over the two different pipelines

X_train, X_test, y_train, y_test = train_test_split(data, Y_cat, random_state=0)

pipe_knn = make_pipeline(
           StandardScaler(),
           KNeighborsClassifier(n_neighbors=3))

pipe_svc = make_pipeline(
           StandardScaler(),
           SVC())

for pipeline, name in zip([pipe_svc, pipe_knn], ['SVC', 'kNN']):
    pipeline.fit(X_train, y_train)
    print('accuracy for %s is %s with chance level being %s' 
          %(name, accuracy_score(pipeline.predict(X_test), y_test), 1/len(pd.unique(Y_cat))))
accuracy for SVC is 0.8974358974358975 with chance level being 0.5
accuracy for kNN is 0.8717948717948718 with chance level being 0.5
  • how about the regression task?

from sklearn.neighbors import KNeighborsRegressor
X_train, X_test, y_train, y_test = train_test_split(data, Y_con, random_state=0)

pipe_knn = make_pipeline(
           StandardScaler(),
           KNeighborsRegressor(n_neighbors=3))

pipe_reg = make_pipeline(
           StandardScaler(),
           LinearRegression())

for pipeline, name in zip([pipe_reg, pipe_knn], ['Reg', 'kNN']):
    pipeline.fit(X_train, y_train)
    print('mean absolute error for %s in years: %s against a data distribution from %s to %s years' 
          %(name, mean_absolute_error(pipeline.predict(X_test), y_test), Y_con.min(), Y_con.max())) 
mean absolute error for Reg in years: 4.116128254997565 against a data distribution from 3.518138261 to 39.0 years
mean absolute error for kNN in years: 4.0175096672735044 against a data distribution from 3.518138261 to 39.0 years

Question for both tasks: which estimator do you choose and why?

logo

https://c.tenor.com/yGhUqB860GgAAAAC/worriedface.gif

Last but not least, another very popular model: tree-ensembles

Supervised learning - tree-ensembles

logo
  • e.g. Random forest

  • classification and regression

    • class selected by most trees or
      mean/average prediction

  • utilization of entire dataset or subsets thereof

    • bagging or bootstrapping

Pros

  • reduces overfitting in decision trees

  • tends to improve accuracy

  • addresses missing values

  • scaling of input not required

Cons

  • expansive regarding computational resources and training time

  • reduced interpretability

  • small changes in data can lead to drastic changes in tress

  • now that we’ve heard about it, we’re going to put it to work

  • comparable to the nearest neighbors model, we’ll check out for both classification and regression tasks

  • we will also compare it to the other models

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
  • at first, within a classification task:

X_train, X_test, y_train, y_test = train_test_split(data, Y_con, random_state=0)

pipe_knn = make_pipeline(
           StandardScaler(),
           KNeighborsRegressor(n_neighbors=3))

pipe_reg = make_pipeline(
           StandardScaler(),
           LinearRegression())

for pipeline, name in zip([pipe_reg, pipe_knn], ['Reg', 'kNN']):
    pipeline.fit(X_train, y_train)
    print('mean absolute error for %s in years: %s against a data distribution from %s to %s years' 
          %(name, mean_absolute_error(pipeline.predict(X_test), y_test), Y_con.min(), Y_con.max())) 
accuracy for SVC is 0.8974358974358975 with chance level being 0.5
accuracy for kNN is 0.8717948717948718 with chance level being 0.5
accuracy for RFC is 0.9487179487179487 with chance level being 0.5

Oooooh damn, it gets better and better: we nearly got a perfect accuracy score. I can already see our Nature publication being accepted…

logo

https://c.tenor.com/wyaFBOMEuskAAAAC/curious-monkey.gif

Maybe it does comparably well within the regression task? Only one way to find out…

X_train, X_test, y_train, y_test = train_test_split(data, Y_con, random_state=0)

pipe_rfc = make_pipeline(
           StandardScaler(),
           RandomForestRegressor(random_state=0))

pipe_knn = make_pipeline(
           StandardScaler(),
           KNeighborsRegressor(n_neighbors=3))

pipe_reg = make_pipeline(
           StandardScaler(),
           LinearRegression())

for pipeline, name in zip([pipe_reg, pipe_knn, pipe_rfc], ['Reg', 'kNN', 'RFC']):
    pipeline.fit(X_train, y_train)
    print('mean absolute error for %s in years: %s against a data distribution from %s to %s years' 
          %(name, mean_absolute_error(pipeline.predict(X_test), y_test), Y_con.min(), Y_con.max())) 
mean absolute error for Reg in years: 4.116128254997565 against a data distribution from 3.518138261 to 39.0 years
mean absolute error for kNN in years: 4.0175096672735044 against a data distribution from 3.518138261 to 39.0 years
mean absolute error for RFC in years: 3.446379857440512 against a data distribution from 3.518138261 to 39.0 years

Won’t you look at that? We got half a year better…nice!

However, what do you think about it?

Now that we’ve spent a fair amount of time to evaluate how we can use the information we already have (labels) to predict a given outcome (Y), we will have a look on the things we can learn from the data (X) without using labels.

Unsupervised learning

logo
  • goal: extract information about X

  • as mentioned before, within unsupervised learning problems, we have two task types

    • decomposition & dimension reduction: PCA, ICA

    • clustering: kmeans, hierarchical clustering

  • comparable to the supervised learning section, we will try each and check what hidden treasures we might discover in our dataset (X)

Unsupervised learning - decomposition & dimensionality reduction

logo
  • goal: extract information about X

Unsupervised learning - PCA

logo

Pros

  • remove correlated features

  • improve performance

  • reduce overfittung

Cons

  • less interpretable

  • scaling required

  • some information lost

Excited about the PCAs of our X? We too!

In general the analysis pipeline and setup doesn’t differ that much between supervised and supervised learning. At first we need to import the class(es) we need:

from sklearn.decomposition import PCA

Next, we need to set up our estimator, the PCA, defining how many components we want to compute/obtain. For the sake of simplicity, we will use 2.

pipe_pca = make_pipeline(
           StandardScaler(),
           PCA(n_components=2))

With that, we can already fit it to our X, saving the output to a new variable, which will be a decomposed/dimensionality reduced version of our input X:

data_pca = pipe_pca.fit_transform(data)

We can now evaluate the components:

data_pca.shape
(155, 2)

Question: What does this represent, i.e. can you explain what the different dimensions are?

We can also plot our components and factor in our labels again to check if, for example, the two components we obtained distinguish age-related variables we tried to predict in the supervised learning examples:

information.head(n=5)
participant_id Age AgeGroup Child_Adult Gender Handedness
0 sub-pixar123 27.06 Adult adult F R
1 sub-pixar124 33.44 Adult adult M R
2 sub-pixar125 31.00 Adult adult M R
3 sub-pixar126 19.00 Adult adult F R
4 sub-pixar127 23.00 Adult adult F R

How about the categorical variable Child_Adult?

labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pipe_pca[1].explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(
    data_pca,
    labels=labels,
    dimensions=range(2),
    color=information["Child_Adult"]
)
fig.update_traces(diagonal_visible=False)

# fig.show()

init_notebook_mode(connected=True)

plot(fig, filename = 'pca.html')
display(HTML('pca.html'))

Not a “perfect” fit, but definitely looks like the PCA was able to compute important components of our data that nicely separate our groups.

We could now work further with our components, e.g. keeping it in the realm of dimensionality reduction and thus using them as X within a supervised learning approach or further evaluating them and test if they also separate more fine-grained classes in our dataset like the AgeGroup or even Age.

However, given our unfortunate time constraints, we will continue with the next decomposition/dimensionality reduction approach: ICA.

Unsupervised learning - ICA

logo

Pros

Cons

Alrighty, let’s see how it performs on our dataset!

You guessed right, we need to import it first:

from sklearn.decomposition import FastICA

The rest works as with the PCA: we define our analysis pipeline

pipe_ica = make_pipeline(
           StandardScaler(),
           FastICA(n_components=2))

and fit it to our dataset:

data_ica = pipe_ica.fit_transform(data)

Coolio! As with PCA, we obtain two components:

data_ica.shape
(155, 2)

However, this time being additive instead of orthogonal.

Any guesses on how things might look like? We can easily check that out.

Question: When would you apply PCA and when ICA?

Decomposition & dimensionality reduction is quite fun, isn’t it? Do you think the second set of unsupervised learning tasks, i.e. clustering can beat that? Only one way to find out…

Unsupervised learning - clustering

logo
  • goal: extract information about X

We saw that we can use decomposition and dimensionality reduction approaches to unravel important dimensions of our data X. But can we also discover a certain structure in an unsupervised learning approach? That is, would it be possible to divide our dataset X into groups or clusters? We will employ two approaches: kmeans and hierarchical clustering to find out!

Unsupervised learning - kmeans

logo

Pros

Cons

Now it’s time to test it on our dataset. After importing the class:

from sklearn.cluster import KMeans

we add it to our pipeline and apply it:

pipe_kmeans = make_pipeline(
           StandardScaler(),
           KMeans(n_clusters=2))

pipe_kmeans.fit(data)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kmeans', KMeans(n_clusters=2))])

Unsupervised learning - hierarchical clustering

logo

Pros

Cons

Well well well, how will hierarchical clustering perform in our dataset X?

from sklearn.cluster import AgglomerativeClustering
pipe_clust = make_pipeline(
              StandardScaler(),
              AgglomerativeClustering(n_clusters=2))

pipe_clust.fit(data)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('agglomerativeclustering', AgglomerativeClustering())])